Many alternatives to classical frequentist statistics and hypothesis testing in particular have been proposed
Bayesian statistics
Lowering significance threshold from \(0.05\) to \(0.005\)
Focus on effect sizes and uncertainty intervals
My opinion:\(P\)-values are here to stay
Improve interpretation of tools that are used
\(P\)-value functions: Graphical tool to achieve that
\(P\)-values
Definition
\(P\) is the probability under the null hypothesis of obtaining a test statistic at least as extreme as the one obtained. “Extreme” means farther from the null value.
Informally, \(p\)-values are a continuous measure of compatibility between the data and a hypothesis, given a set of background information (for details, see here).
Just one \(p\)-value?
“The null hypothesis states that there is no effect/no difference.” No!
\(P\)-values in papers often have implicit hypotheses: \(\operatorname{H}_{0}:\theta = \color{red}{0}\) vs. \(\operatorname{H}_{1}:\theta\neq \color{red}{0}\).
In other words: The \(p\)-value is for a test against the null value of \(0\).
But: You can calculate \(p\)-values for other null values too!
Example: \(\operatorname{H}_{0}:\theta = \color{red}{2.5}\) vs. \(\operatorname{H}_{1}:\theta\neq \color{red}{2.5}\).
Confidence intervals and their relation to \(p\)-values
“Our results are most compatible at the 95% level with an effect of high versus low amounts of sitting anywhere from an 8% hazard reduction to 55% increase in the hazard of diabetes.”
Based on the results, do the data favor \(\operatorname{RR} = 1\) or \(\operatorname{RR} = 2\)?
The point estimate is \(1.5\): Closer to \(2\) than to \(1\) in proportional terms.
\(P\)-value for \(\operatorname{RR} = 2\) is \(0.20\), 3 times the \(p\)-value for \(\operatorname{RR} = 1\).
\(1\) is closer to lower CI limit than \(2\) is to the upper limit (in proportional terms).
Summary: Despite “nonsignificance” and power approximating 90% for \(\operatorname{RR} = 2\), the results favor \(\operatorname{RR} = 2\) over \(\operatorname{RR} = 1\)! (Greenland, 2012)
Allows the creation of \(p\)-value functions directly from models using the function p_function. Here is an example from a linear regression model:
Code
mod <-lm(mpg ~ wt +as.factor(gear) + am, data = mtcars)p_curve <-p_function(mod, ci_levels =c(emph =0.95))plot(p_curve, n_columns =2)
Take-Home-Messages
The practice of dichotomizing results into “significant” and “not significant” is not informative (Gelman & Stern, 2006).
Focus on effect sizes and confidence/compatibility intervals.
Create \(p\)-value functions to summarize the available evidence:
Values with high and low compatibility with the data.
To compare evidence from different studies.
Never say that “we found no effect/association” or “there was no difference” when \(p>0.05\)(Greenland et al., 2016)
Recommended reading
Bender, R., Berg, G., & Zeeb, H. (2005). Tutorial: UsingConfidenceCurves in MedicalResearch. Biometrical Journal, 47(2), 237–247. https://doi.org/10.1002/bimj.200410104
Gelman, A., & Stern, H. (2006). The DifferenceBetween“Significant” and “NotSignificant” is not ItselfStatisticallySignificant. The American Statistician, 60(4), 328–331. https://doi.org/10.1198/000313006X152649
Greenland, S. (2012). Nonsignificance PlusHighPowerDoesNotImplySupport for the NullOver the Alternative. Annals of Epidemiology, 22(5), 364–368. https://doi.org/10.1016/j.annepidem.2012.02.007
Greenland, S. (2019). Valid P-ValuesBehaveExactly as TheyShould: SomeMisleadingCriticisms of P-Values and TheirResolutionWithS-Values. The American Statistician, 73(sup1), 106–114. https://doi.org/10.1080/00031305.2018.1529625
Greenland, S., Senn, S. J., Rothman, K. J., Carlin, J. B., Poole, C., Goodman, S. N., & Altman, D. G. (2016). Statistical tests, P values, confidence intervals, and power: A guide to misinterpretations. European Journal of Epidemiology, 31(4), 337–350. https://doi.org/10.1007/s10654-016-0149-3
Infanger, D., & Schmidt-Trucksäss, A. (2019). P value functions: An underused method to present research results and to promote quantitative reasoning. Statistics in Medicine, 38(21), 4189–4197. https://doi.org/10.1002/sim.8293
Rafi, Z., & Greenland, S. (2020). Semantic and cognitive tools to aid statistical science: Replace confidence and significance by compatibility and surprise. BMC Medical Research Methodology, 20(1), 244. https://doi.org/10.1186/s12874-020-01105-9